Please knit this file with HTML as it has 3D plots which can be visualized in HTML Page only.
This report analyzes COVID-19 data from 2020 through 2023. The dataset, compiled by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, documents the impact of COVID-19 across the United States and globally.
Data Source: - https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
GitHub - https://github.com/Cosmic-AJ/Mini-Projects/tree/main/COVID-19%20Data%20Report
library(readr)
library(tidyverse)
library(party)
library(caret)
library(e1071)
library(usmap)
library(ggplot2)
library(rworldmap)
library(plotly)
library(nnet)
library(usmap)
library(ggnewscale)
The data are sourced directly from publicly available GitHub CSV files maintained by Johns Hopkins University.
url_in = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/refs/heads/master/csse_covid_19_data/csse_covid_19_time_series/'
filesnames = c('time_series_covid19_confirmed_US.csv',
'time_series_covid19_confirmed_global.csv',
'time_series_covid19_deaths_US.csv',
'time_series_covid19_deaths_global.csv',
'time_series_covid19_recovered_global.csv'
)
urls = str_c(url_in, filesnames)
us_cases = read_csv(urls[1])
## Rows: 3342 Columns: 1154
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1148): UID, code3, FIPS, Lat, Long_, 1/22/20, 1/23/20, 1/24/20, 1/25/20...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_cases = read_csv(urls[2])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
us_deaths = read_csv(urls[3])
## Rows: 3342 Columns: 1155
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1149): UID, code3, FIPS, Lat, Long_, Population, 1/22/20, 1/23/20, 1/24...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_deaths = read_csv(urls[4])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_recovered = read_csv(urls[5])
## Rows: 274 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
uid_lookup_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/refs/heads/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
uid = read_csv(uid_lookup_url) %>%
select(-c(Lat, Long_, Combined_Key, code3, iso2, iso3, Admin2))
## Rows: 4321 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): iso2, iso3, FIPS, Admin2, Province_State, Country_Region, Combined_Key
## dbl (5): UID, code3, Lat, Long_, Population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Unnecessary columns are removed and data is reshaped to facilitate time series and spatial analyses. Feature engineering is used to derive new summary statistics.
global_cases = global_cases %>%
pivot_longer(cols = -c('Province/State', 'Country/Region', 'Lat', 'Long'),
names_to='date',
values_to='cases') %>%
select(-c(Lat, Long))
global_deaths = global_deaths %>%
pivot_longer(cols = -c('Province/State', 'Country/Region', 'Lat', 'Long'),
names_to='date',
values_to='deaths') %>%
select(-c(Lat, Long))
global_recovered = global_recovered %>%
pivot_longer(cols = -c('Province/State', 'Country/Region', 'Lat', 'Long'),
names_to='date',
values_to='recovered') %>%
select(-c(Lat, Long))
us_cases = us_cases %>%
pivot_longer(cols=-(UID:Combined_Key),
names_to='date',
values_to='cases') %>%
select(Admin2:cases) %>%
mutate(date=mdy(date)) %>%
select(-c(Lat, Long_))
us_deaths = us_deaths %>%
pivot_longer(cols=-(UID:Population),
names_to='date',
values_to='deaths') %>%
select(Admin2:deaths) %>%
mutate(date=mdy(date)) %>%
select(-c(Lat, Long_))
Data tables are joined to combine cases, deaths, and recovered counts.
global = global_cases %>%
full_join(global_deaths) %>%
full_join(global_recovered) %>%
rename(Country_Region='Country/Region',
Province_State ='Province/State') %>%
mutate(date=mdy(date))
## Joining with `by = join_by(`Province/State`, `Country/Region`, date)`
## Joining with `by = join_by(`Province/State`, `Country/Region`, date)`
global = global %>%
filter(cases>0)
us = us_cases %>%
full_join((us_deaths))
## Joining with `by = join_by(Admin2, Province_State, Country_Region,
## Combined_Key, date)`
Further join and cleanup for global data
global = global %>%
unite('Combined_Key',
c(Province_State, Country_Region),
sep=",",
na.rm=TRUE,
remove=FALSE)
global = global %>%
left_join(uid, by=c("Province_State", "Country_Region")) %>%
select(-c(UID, FIPS)) %>%
select(Province_State, Country_Region, date, cases, deaths, recovered, Population, Combined_Key)
Aggregate at the US state and national levels, and compute new daily cases/deaths
us_by_state = us %>%
group_by(Province_State, Country_Region, date) %>%
summarize(cases=sum(cases), deaths=sum(deaths),
population=sum(Population)) %>%
mutate(deaths_per_mill=deaths*1000000/population) %>%
select(Province_State, Country_Region, date, cases, deaths, deaths_per_mill, population) %>%
ungroup()
## `summarise()` has grouped output by 'Province_State', 'Country_Region'. You can
## override using the `.groups` argument.
us_totals = us_by_state %>%
group_by(Country_Region, date) %>%
summarize(cases=sum(cases), deaths=sum(deaths),
population=sum(population)) %>%
mutate(deaths_per_mill=deaths*1000000/population) %>%
select(Country_Region, date, cases, deaths, deaths_per_mill, population) %>%
ungroup()
## `summarise()` has grouped output by 'Country_Region'. You can override using
## the `.groups` argument.
max(us_by_state$date)
## [1] "2023-03-09"
min(us_by_state$date)
## [1] "2020-01-22"
us_by_state = us_by_state %>%
mutate(new_cases = cases-lag(cases),
new_deaths=deaths-lag(deaths))
us_totals = us_totals %>%
mutate(new_cases = cases-lag(cases),
new_deaths= deaths-lag(deaths))
tail(us_totals %>% select(new_cases, new_deaths, everything()))
## # A tibble: 6 × 8
## new_cases new_deaths Country_Region date cases deaths deaths_per_mill
## <dbl> <dbl> <chr> <date> <dbl> <dbl> <dbl>
## 1 2147 7 US 2023-03-04 1.04e8 1.12e6 3371.
## 2 -3862 -38 US 2023-03-05 1.04e8 1.12e6 3371.
## 3 8564 47 US 2023-03-06 1.04e8 1.12e6 3371.
## 4 35371 335 US 2023-03-07 1.04e8 1.12e6 3372.
## 5 64861 730 US 2023-03-08 1.04e8 1.12e6 3374.
## 6 46931 590 US 2023-03-09 1.04e8 1.12e6 3376.
## # ℹ 1 more variable: population <dbl>
Summarize global COVID-19 statistics at the country level
global_totals = global %>%
group_by(Country_Region, date) %>%
summarize(cases=sum(cases), deaths=sum(deaths),
recovered=sum(recovered),
population=sum(Population)) %>%
mutate(deaths_per_mill=deaths*1000000/population) %>%
select(Country_Region, date, cases, deaths, recovered, deaths_per_mill, population) %>%
ungroup()
## `summarise()` has grouped output by 'Country_Region'. You can override using
## the `.groups` argument.
global_totals = global_totals %>%
mutate(new_cases = cases-lag(cases),
new_deaths=deaths-lag(deaths),
new_recovered=recovered-lag(recovered))
global_totals = global_totals %>%
group_by(Country_Region) %>%
summarize(deaths=max(deaths),
cases=max(cases),
recovered=max(recovered),
population=max(population),
active=cases-recovered,
cases_per_thou = 1000*cases/population,
deaths_per_thou = 1000*deaths/population,
recovered_per_thou = 1000*recovered/population) %>%
filter(cases>0, population>0)
Identify countries with highest/lowest death and recovery rates. It also reflects the country with maximum active cases.
global_totals %>%
slice_min(deaths_per_thou, n=10) %>%
select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
## deaths_per_thou cases_per_thou Country_Region deaths cases recovered
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 35.8 Holy See 0 29 27
## 2 0 240. Tuvalu 0 2828 0
## 3 0.000233 0.0000388 Korea, North 6 1 0
## 4 0.00320 4.51 Burundi 38 53631 773
## 5 0.0118 0.467 Chad 194 7679 4796
## 6 0.0123 1.64 South Sudan 138 18368 10514
## 7 0.0130 0.393 Niger 315 9508 5351
## 8 0.0131 1.86 Tajikistan 125 17786 14867
## 9 0.0134 2.31 Benin 163 27999 8136
## 10 0.0142 0.718 Tanzania 846 42906 183
## # ℹ 3 more variables: population <dbl>, active <dbl>, recovered_per_thou <dbl>
global_totals %>%
slice_max(deaths_per_thou, n=10) %>%
select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
## deaths_per_thou cases_per_thou Country_Region deaths cases recovered
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 6.66 136. Peru 219539 4.49e6 2086086
## 2 5.50 187. Bulgaria 38228 1.30e6 398721
## 3 5.05 227. Hungary 48762 2.20e6 749773
## 4 4.96 122. Bosnia and Herzegovina 16280 4.02e5 189710
## 5 4.64 166. North Macedonia 9662 3.47e5 150440
## 6 4.47 460. Montenegro 2808 2.89e5 99152
## 7 4.38 309. Croatia 17987 1.27e6 354830
## 8 4.25 458. Georgia 16971 1.83e6 390827
## 9 3.97 431. Czechia 42491 4.62e6 1641321
## 10 3.87 491. Slovakia 21035 2.67e6 255300
## # ℹ 3 more variables: population <dbl>, active <dbl>, recovered_per_thou <dbl>
global_totals %>%
slice_min(recovered_per_thou, n=10) %>%
select(recovered_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
## recovered_per_thou cases_per_thou Country_Region deaths cases recovered
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 42.6 Kiribati 18 5014 0
## 2 0 0.0000388 Korea, North 6 1 0
## 3 0 484. Nauru 1 5247 0
## 4 0 333. Palau 9 5991 0
## 5 0 267. Sweden 23777 2699339 0
## 6 0 159. Tonga 13 16810 0
## 7 0 240. Tuvalu 0 2828 0
## 8 0.00306 0.718 Tanzania 846 42906 183
## 9 0.00879 210. Micronesia 61 23948 1
## 10 0.0103 41.0 Vanuatu 14 12014 3
## # ℹ 3 more variables: population <dbl>, active <dbl>, deaths_per_thou <dbl>
global_totals %>%
slice_max(recovered_per_thou, n=10) %>%
select(recovered_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
## recovered_per_thou cases_per_thou Country_Region deaths cases recovered
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 186. 620. Andorra 165 47890 14380
## 2 182. 515. Seychelles 172 50665 17874
## 3 158. 460. Montenegro 2808 288808 99152
## 4 157. 418. Bahrain 1553 710693 267220
## 5 153. 431. Czechia 42491 4618256 1641321
## 6 148. 696. San Marino 122 23616 5009
## 7 139. 344. Maldives 311 185738 75095
## 8 122. 641. Slovenia 7078 1331707 253972
## 9 116. 507. Luxembourg 1220 317367 72306
## 10 108. 298. Uruguay 7617 1034303 374203
## # ℹ 3 more variables: population <dbl>, active <dbl>, deaths_per_thou <dbl>
global_totals %>%
slice_max(active, n=10) %>%
select(active, cases_per_thou, everything())
## # A tibble: 10 × 9
## active cases_per_thou Country_Region deaths cases recovered population
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 97504620 315. US 1123836 103802702 6298082 329466283
## 2 39451607 585. France 166176 39866718 415111 68128061
## 3 34589800 460. Germany 168935 38249060 3659260 83155031
## 4 32467987 263. Japan 72997 33320438 852451 126476458
## 5 30434803 597. Korea, South 34093 30615522 180719 51269183
## 6 24642334 360. United Kingdom 220721 24658705 16371 68403187
## 7 21458902 423. Italy 188322 25603510 4144608 60461828
## 8 19309981 174. Brazil 699276 37081209 17771228 212559409
## 9 16466176 151. Russia 388478 22075858 5609682 145934460
## 10 13715990 32.4 India 530779 44690738 30974748 1380004385
## # ℹ 2 more variables: deaths_per_thou <dbl>, recovered_per_thou <dbl>
The following plot shows cumulative COVID-19 cases and deaths over time in the United States.
us_totals %>%
filter(cases > 0) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = cases, color = "Cases")) +
geom_point(aes(y = cases, color = "Cases")) +
geom_line(aes(y = deaths, color = "Deaths")) +
geom_point(aes(y = deaths, color = "Deaths")) +
scale_y_log10() +
labs(title = "Cumulative COVID-19 Cases and Deaths in the US",
y = "Count (log scale)",
color = "Metric") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 90, hjust = 1))
This plot shows new COVID-19 cases and deaths per day in the US.
us_totals %>%
ggplot(aes(x = date)) +
geom_line(aes(y = new_cases, color = "New Cases")) +
geom_point(aes(y = new_cases, color = "New Cases")) +
geom_line(aes(y = new_deaths, color = "New Deaths")) +
geom_point(aes(y = new_deaths, color = "New Deaths")) +
scale_y_log10() +
labs(title = "Daily New COVID-19 Cases and Deaths in the US",
y = "Count (log scale)",
color = "Metric") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 90, hjust = 1))
US States: Cases and Deaths Per Thousand - Aggregate cases and deaths per thousand by US state. Display states with highest and lowest death rates.
us_state_totals = us_by_state %>%
group_by(Province_State) %>%
summarize(deaths=max(deaths), cases=max(cases),
population=max(population),
cases_per_thou = 1000*cases/population,
deaths_per_thou = 1000*deaths/population) %>%
filter(cases>0, population>0)
us_state_totals %>%
slice_min(deaths_per_thou, n=10) %>%
select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 6
## deaths_per_thou cases_per_thou Province_State deaths cases population
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0.611 150. American Samoa 34 8.32e3 55641
## 2 0.744 248. Northern Mariana Isl… 41 1.37e4 55144
## 3 1.21 231. Virgin Islands 130 2.48e4 107268
## 4 1.30 269. Hawaii 1841 3.81e5 1415872
## 5 1.49 245. Vermont 929 1.53e5 623989
## 6 1.55 293. Puerto Rico 5823 1.10e6 3754939
## 7 1.65 340. Utah 5298 1.09e6 3205958
## 8 2.01 415. Alaska 1486 3.08e5 740995
## 9 2.03 252. District of Columbia 1432 1.78e5 705749
## 10 2.06 253. Washington 15683 1.93e6 7614893
us_state_totals %>%
slice_min(deaths_per_thou, n=10) %>%
select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 6
## deaths_per_thou cases_per_thou Province_State deaths cases population
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0.611 150. American Samoa 34 8.32e3 55641
## 2 0.744 248. Northern Mariana Isl… 41 1.37e4 55144
## 3 1.21 231. Virgin Islands 130 2.48e4 107268
## 4 1.30 269. Hawaii 1841 3.81e5 1415872
## 5 1.49 245. Vermont 929 1.53e5 623989
## 6 1.55 293. Puerto Rico 5823 1.10e6 3754939
## 7 1.65 340. Utah 5298 1.09e6 3205958
## 8 2.01 415. Alaska 1486 3.08e5 740995
## 9 2.03 252. District of Columbia 1432 1.78e5 705749
## 10 2.06 253. Washington 15683 1.93e6 7614893
This map shows total reported cases by US state. California has reported the maximum cases in the country.
us_state_plot_data = us_state_totals %>%
select(Province_State, cases) %>%
rename(state = "Province_State")
plot_usmap(data = us_state_plot_data,
values = "cases",
color = "blue",
labels = TRUE) +
scale_fill_continuous(name = "Total Cases", label = scales::comma) +
labs(title = "Total COVID-19 Cases by US State") +
theme(panel.background = element_rect(color = "black", fill = "lightblue"))
This map shows total deaths reported in each US state. The below spatial analysis shows that most number of deaths have been reported in California followed by Texas. The least number of deaths are reported in American Samoa.
us_state_deaths_plot_data = us_state_totals %>%
select(Province_State, deaths) %>%
rename(state = "Province_State")
plot_usmap(data = us_state_deaths_plot_data,
values = "deaths",
color = "red",
labels = TRUE) +
scale_fill_continuous(name = "Total Deaths", label = scales::comma) +
labs(title = "Total COVID-19 Deaths by US State") +
theme(panel.background = element_rect(color = "black", fill = "lightblue"))
The following map visualizes total COVID-19 cases per country.
global_plot_data = global_totals %>%
select(Country_Region, cases) %>%
rename(country = "Country_Region")
CASES <- global_plot_data %>%
group_by(country) %>%
summarise(`Total cases` = sum(cases)) %>%
mutate(country = gsub("_", " ", country))
COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0, 0, 1, 0))
mapCountryData(COVID.map, nameColumnToPlot = "Total cases", colourPalette = 'topo')
The below graph shows the deaths caused by Covid 19 for all of the countries in the world.
global_deaths_plot_data = global_totals %>%
select(Country_Region, deaths) %>%
rename(country = "Country_Region")
CASES <- global_deaths_plot_data %>%
group_by(country) %>%
summarise(`Total deaths` = sum(deaths)) %>%
mutate(country = gsub("_", " ", country))
COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0,0,1,0))
mapCountryData(COVID.map, nameColumnToPlot = "Total deaths", colourPalette = 'heat')
The below graph shows the recoveries from Covid 19 for all of the countries in the world.
global_recovered_plot_data = global_totals %>%
select(Country_Region, recovered) %>%
rename(country = "Country_Region")
CASES <- global_recovered_plot_data %>%
group_by(country) %>%
summarise(`Total recovered` = sum(recovered)) %>%
mutate(country = gsub("_", " ", country))
COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0,0,1,0))
mapCountryData(COVID.map, nameColumnToPlot = "Total recovered", colourPalette = 'terrain')
The below graph shows the total active cases of Covid 19 for all of the countries in the world.
global_active_plot_data = global_totals %>%
select(Country_Region, active) %>%
rename(country = "Country_Region")
CASES <- global_active_plot_data %>%
group_by(country) %>%
summarise(`Total active` = sum(active)) %>%
mutate(country = gsub("_", " ", country))
COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0,0,1,0))
mapCountryData(COVID.map, nameColumnToPlot = "Total active", colourPalette = 'rainbow')
The COVID 19 data is modeled to predict the number of deaths with relation to all of the cases reported in us and world. We use linear regression, SVM, and neural network models to predict COVID-19 death rates based on cases and recoveries, comparing model performance.
Predicting deaths per thousand using cases per thousand (US state-level data). Linear Regression tries to use one or more factors to identify linear relationship with the resultant value.
mod = lm(deaths_per_thou ~ cases_per_thou, data=us_state_totals)
summary(mod)
##
## Call:
## lm(formula = deaths_per_thou ~ cases_per_thou, data = us_state_totals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3352 -0.5978 0.1491 0.6535 1.2086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.36167 0.72480 -0.499 0.62
## cases_per_thou 0.01133 0.00232 4.881 9.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8615 on 54 degrees of freedom
## Multiple R-squared: 0.3061, Adjusted R-squared: 0.2933
## F-statistic: 23.82 on 1 and 54 DF, p-value: 9.763e-06
The below graph shows the linear regression for us state data. It compares the actual data points and the ones predicted using the model.
us_total_w_pred = us_state_totals %>%
mutate(pred=predict(mod))
us_total_w_pred %>%
ggplot() +
geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") +
geom_point(aes(x = cases_per_thou, y = pred), color = "red") +
labs(
title = "Linear Regression: Observed vs. Predicted Deaths per Thousand (US States)",
x = "Cases per Thousand",
y = "Deaths per Thousand"
) +
theme_minimal()
The below Linear regression model predicts the death rate using the total cases for all the countries across the globe. The graph shows the predicted and actual values for global dataset.
mod = lm(deaths_per_thou ~ cases_per_thou, data=global_totals)
summary(mod)
##
## Call:
## lm(formula = deaths_per_thou ~ cases_per_thou, data = global_totals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4069 -0.6090 -0.3898 0.4689 5.5423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6277935 0.1116605 5.622 6.57e-08 ***
## cases_per_thou 0.0035877 0.0004393 8.167 4.13e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.151 on 192 degrees of freedom
## Multiple R-squared: 0.2578, Adjusted R-squared: 0.254
## F-statistic: 66.7 on 1 and 192 DF, p-value: 4.127e-14
global_totals_w_pred = global_totals %>%
mutate(pred=predict(mod))
global_totals_w_pred %>%
ggplot() +
geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") +
geom_point(aes(x = cases_per_thou, y = pred), color = "red") +
labs(
title = "Linear Regression: Observed vs. Predicted Deaths per Thousand (Global)",
x = "Cases per Thousand",
y = "Deaths per Thousand"
) +
theme_minimal()
The below Linear regression model predicts the death rate using the total cases and recovery rate for all the countries accross the globe. It also shows the 3 dimensional graphs where the predicted and actual values are shown in 2 different graphs.
mod = lm(deaths_per_thou ~ cases_per_thou + recovered_per_thou, data=global_totals)
summary(mod)
##
## Call:
## lm(formula = deaths_per_thou ~ cases_per_thou + recovered_per_thou,
## data = global_totals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7061 -0.4499 -0.3471 0.4326 4.9949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4450801 0.1035210 4.299 2.73e-05 ***
## cases_per_thou 0.0018710 0.0004654 4.020 8.36e-05 ***
## recovered_per_thou 0.0152332 0.0022007 6.922 6.58e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.031 on 191 degrees of freedom
## Multiple R-squared: 0.4067, Adjusted R-squared: 0.4005
## F-statistic: 65.46 on 2 and 191 DF, p-value: < 2.2e-16
global_totals_w_pred = global_totals %>%
mutate(pred=predict(mod))
global_totals_w_pred %>%
ggplot() +
geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") +
geom_point(aes(x = cases_per_thou, y = pred), color = "red") +
labs(
title = "Multiple Linear Regression: Deaths per Thousand vs. Cases and Recovery",
x = "Cases per Thousand",
y = "Deaths per Thousand"
) +
theme_minimal()
plot_ly(data = global_totals_w_pred, x = ~cases_per_thou, y = ~recovered_per_thou, z = ~pred,
type = 'scatter3d', mode = 'markers',
marker = list(size = 5, color = ~pred, colorscale = 'Viridis')) %>%
layout(scene = list(xaxis = list(title = 'Cases per Thousand'),
yaxis = list(title = 'Recovered Per Thousand'),
zaxis = list(title = 'Predicted Deaths per Thousand')))
plot_ly(data = global_totals_w_pred, x = ~cases_per_thou, y = ~recovered_per_thou, z = ~deaths_per_thou,
type = 'scatter3d', mode = 'markers',
marker = list(size = 5, color = ~deaths_per_thou, colorscale = 'Viridis')) %>%
layout(scene = list(xaxis = list(title = 'Cases per Thousand'),
yaxis = list(title = 'Recovered Per Thousand'),
zaxis = list(title = 'Deaths per Thousand')))
Model Performance Metrics: The below graph shows the performance of the model. It reflects the actual and predicted values with a Root Mean Squared Value of 1.14
rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
actual_values <- global_totals_w_pred$deaths_per_thou
predicted_values <- global_totals_w_pred$pred
print(paste("RMSE:", rmse(actual_values, predicted_values)))
## [1] "RMSE: 1.02340386994039"
print(paste("R2:", caret::R2(predicted_values, actual_values)))
## [1] "R2: 0.406675841326311"
plot(actual_values, type = "l", col = "blue", main = "Actual vs Predicted Values", xlab = "Index", ylab = "Deaths per Thousand")
lines(predicted_values, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lty = 1)
The below code uses SVM model using the total cases and recoveries for covid 19 for global data. The performance of this model is not good as compared to Linear Regression because of high RMSE.
regressor <- svm(deaths_per_thou ~ cases_per_thou + recovered_per_thou,
data = global_totals,
kernel = "linear", cost = 10, scale = FALSE)
predictions <- predict(regressor, newdata = global_totals)
print(paste("SVM RMSE:", rmse(global_totals$deaths_per_thou, predictions)))
## [1] "SVM RMSE: 1.31319395596001"
plot(global_totals$deaths_per_thou, type = "l", col = "blue", main = "SVM: Actual vs Predicted", xlab = "Index", ylab = "Deaths per Thousand")
lines(predictions, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lty = 1)
The below code uses neural network model using the total cases and recoveries for covid 19 for global data. The performance of this model is best as it has the lowest Root Mean Squared value among the models tested as part of this report.
nn = nnet(deaths_per_thou ~ cases_per_thou + recovered_per_thou, data = global_totals, size = 10, maxit = 1000, linout = TRUE)
## # weights: 41
## initial value 353.561299
## iter 10 value 254.469143
## iter 20 value 190.203043
## iter 30 value 159.484947
## iter 40 value 149.918189
## iter 50 value 145.741366
## iter 60 value 145.067310
## iter 70 value 144.214227
## iter 80 value 141.954202
## iter 90 value 131.189503
## iter 100 value 125.918014
## iter 110 value 120.917149
## iter 120 value 117.646201
## iter 130 value 111.305433
## iter 140 value 107.328426
## iter 150 value 101.221107
## iter 160 value 99.648573
## iter 170 value 98.911198
## iter 180 value 98.424001
## iter 190 value 97.529997
## iter 200 value 97.340605
## iter 210 value 97.331423
## iter 220 value 97.327366
## iter 230 value 97.318121
## iter 240 value 97.317079
## iter 250 value 97.316419
## final value 97.316411
## converged
predictions <- predict(nn, newdata = global_totals)
print(paste("Neural Network RMSE:", rmse(global_totals$deaths_per_thou, predictions)))
## [1] "Neural Network RMSE: 0.708259123738405"
plot(global_totals$deaths_per_thou, type = "l", col = "blue", main = "Neural Network: Actual vs Predicted", xlab = "Index", ylab = "Deaths per Thousand")
lines(predictions, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lty = 1)
While the data from Johns Hopkins University aggregates multiple sources, some cases may go unreported due to limitations in data availability and cross-country reporting standards. Data coverage is more comprehensive for the US compared to other countries, which could introduce bias.
On the analysis side, variable selection (cases, deaths, recovered) and certain model choices reflect analyst bias in focusing on specific aspects of the pandemic. Visualization and mapping may omit regions due to missing identifiers.
This report provides an overview and modeling of COVID-19 incidence and mortality trends both in the US and globally. Countries and US states with the highest and lowest case and death rates were identified. Neural Network models offered the best predictive accuracy for deaths per thousand among models tested. The analysis highlights patterns that can assist with public health planning and response, but should always be interpreted with awareness of data and modeling limitations.